Fixes Indexed assignment of extra columns in SolutionArray #838

sin-ha · 2020-04-01T13:15:36Z

Checklist

There is a clear use-case for this code change
The commit message has a short title & references relevant issues
Build passes (scons build & scons test) and unit tests address code coverage
The pull request is ready for review

If applicable, fill in the issue number this pull request is fixing

Fixes #829

Changes proposed in this pull request

-Earlier the self._extra[name] was initialised as a list but when the set item method was called it returned a np.array() copy of it
-So I propose to initialise the init method with the np.array and amended other methods to abide with it
-I think that this would be a better implementation.

This solves item assignment as the original object gets passed instead of the copy

ischoegl · 2020-04-11T02:00:45Z

interfaces/cython/cantera/composite.py

@@ -523,9 +523,9 @@ def __init__(self, phase, shape=(0,), states=None, extra=None):
                        "Unable to create extra column '{}': name is already "
                        "used by SolutionArray objects.".format(name))
                if not np.shape(v):
-                    self._extra[name] = [v]*self._shape[0]
+                    self._extra[name] = np.array([v]*self._shape[0])


I know this is a pre-existing choice, but It may make sense to allow for more dimensions? I.e. this would be a case where we have something like

>>> states = ct.SolutionArray(gas, (6, 10), extra=[‘foo’, ‘bar’])

I can't clearly understand what you want to say. I guess you want to say foo and bar to be extra arrays of size (6,10)?
like

>>> states = ct.SolutionArray(gas, (6,10), extra={'foo':sd,"bar":sd})

where sd is any (6,10) list/np array
??

@sin-ha ... what I wrote was deliberate - no typo there. It currently throws a ValueError, but you could argue that it should initialize np.ndarray's with the full _shape, not just the first dimension. From my perspective, the shape of the _extra entries should be consistent with other data.

@ischoegl I'm also not sure what you're suggesting, but it seems like it would be a new feature? In that case, I think it seems outside the scope of this fix.

@ischoegl As of now extra expects dicts unless the shape of SolutionArray is trivial

If the user passes a list we should be able to initiate the extra array with same shape.

else the user will pass that as a dict , in that case we should validate extra with the complete SolutionArray shape instead of self._shape[0]

I also do think this is more of adding another functionality, seems a bit off of the issue so should it be pulled here or open another issue?

I think the point I raised is a trivial addition, and the application was stated in my first post: it’s a reasonable assumption that the arrays are instantiated with correct dimensions based on the information already provided. My guess is that the former single-dimensioned approach was based on _extra formerly being list-based. Now that this is no longer the case, putting in a more consistent approach is trivial.

What I am talking about affects some of the same lines that are changed already and won’t take more than a couple of additional lines. Hence, I’d like to see it considered as a friendly amendment of the original issue 😉

ischoegl · 2020-04-13T18:56:48Z

interfaces/cython/cantera/composite.py

+        if isinstance(extra,str):
+            extra = [extra]
+
+        if isinstance(extra, list) and all(isinstance(name, str) for name in extra):


That's captures almost everything! However, extra could also be an instance of tuple or ndarray of strings? I believe all of these are supported for when no shape is used for the instantiation of SolutionArray (i.e. I am talking about the former elif extra and self._shape == (0,): case, in which case any of the above can be iterated over).

that'd be a quick fix but I don't think np.arrays are important?.
Besides I believe the other part should be removed where self._shape is trivial as after this commit it would only be a desperate attempt to map something in extra, probably not useful (is there a particular use case?). it lacks checks and something weird like integers, for instance can pass which can't be used back.e.g.

>>>>states3 = ct.SolutionArray(gas, shape=(0,0,0), extra=[1])

I don’t think that the alternative case can be executed for useful content either (you already caught all useful cases), I.e. I believe it can be deleted. If I’m not mistaken, the unit tests should still pass.

On a formatting note, there are a couple of spaces missing after commas (e.g. it should be shape=(0, 0, 0) in the previous post; same for some other lines you edited). It’s minor, but the code should remain in compliance with PEP 8.

PS: ad ndarrays ... again it’s minor but it should be included for consistency.

Yes the unit tests pass.
sorry for the formatting though.

interfaces/cython/cantera/test/test_thermo.py

bryanwweber

Hi @sin-ha! Thanks for taking this on. I've got several comments below, I think we need to change how this is implemented. Let me know your thoughts, or if you have any questions/concerns!

interfaces/cython/cantera/composite.py

bryanwweber · 2020-05-29T18:41:30Z

interfaces/cython/cantera/composite.py

@@ -590,7 +596,7 @@ def append(self, state=None, **kwargs):
            raise IndexError("Can only append to 1D SolutionArray")

        for name, value in self._extra.items():
-            value.append(kwargs.pop(name))
+            np.append(value,kwargs.pop(name))


Suggested change

np.append(value,kwargs.pop(name))

np.append(value, kwargs.pop(name))

But this isn't correct, np.append does not work in-place. See the documentation: https://numpy.org/doc/1.18/reference/generated/numpy.append.html This will most likely be very slow, since NumPy allocates a new array every time np.append() is called. I wonder if it would be better to cast the values in _extra to lists and use the list append method, and then cast them back to arrays? Maybe the complexity isn't worth it though.

I also can't find any other way around see

What do you mean, any other way around? You'll have to find some way, since this won't work as expected. I think the simplest method is just to reassign value: value = np.append(value, kwargs.pop(name)) That might be slow, since NumPy has to create a new array every time, but hopefully not too slow.

interfaces/cython/cantera/composite.py

bryanwweber · 2020-05-29T18:55:37Z

I forgot to review the test file as well! Sorry, my main comment there was that I think each time you have a comment that delimits a different set of tests, that should be a new test function.

bryanwweber · 2020-06-29T11:23:37Z

@sin-ha Are you going to be able to address my comments? Thanks 😄

sin-ha · 2020-06-29T11:30:30Z

Yes @bryanwweber, actually i was a bit busy with my never ending university exams and forgot about it
Sorry for the late reply. I will look into everything asap.

bryanwweber · 2020-06-29T14:47:44Z

Thanks @sin-ha I certainly understand having lots of exams 😊

bryanwweber

Thanks for the changes @sin-ha, just a few things left!

interfaces/cython/cantera/composite.py

sin-ha · 2020-07-02T17:21:50Z

done @bryanwweber .
Meanwhile, may I also add my name to Cantera/AUTHORS ?. I don't know if I qualify though 😅

bryanwweber

Thanks @sin-ha. One change still to do, and please feel free to add yourself to AUTHORS!

bryanwweber · 2020-07-02T17:40:21Z

interfaces/cython/cantera/composite.py

@@ -608,7 +623,7 @@ def append(self, state=None, **kwargs):
            raise IndexError("Can only append to 1D SolutionArray")

        for name, value in self._extra.items():
-            value.append(kwargs.pop(name))
+            np.append(value, kwargs.pop(name))


This line still isn't fixed.

I didn't think this through before, Anyways

I think the simplest method is just to reassign value: `value = np.append(value, kwargs.pop(name))

Appending np.ndarrays is not as fast

Turns out converting them to lists and appending (as you said before). does a better job!!,

see https://stackoverflow.com/a/22394181 and https://stackoverflow.com/questions/7133885/fastest-way-to-grow-a-numpy-numeric-array

In [2]: %%timeit ...: a = np.empty((0,3), int) ...: l = list(a) ...: for i in range(1000): ...: l.append([3*i+1,3*i+2,3*i+3]) ...: l = np.array(l) 467 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [3]: %%timeit ...: a = np.array([3,4,3]) ...: for i in range(1000): ...: l = list(a) ...: l.append([3*i+1,3*i+2,3*i+3]) ...: l = np.array(l) ...: 2.27 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [4]: %%timeit ...: a = np.empty((0,3), int) ...: for i in range(1000): ...: a = np.append(a, 3*i+np.array([[1,2,3]]), 0) ...: 3.88 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bryanwweber · 2020-07-06T18:05:46Z

@sin-ha I pushed a few more commits to this branch to fix a few edge cases I found in some testing. For your reference, the changes are in commits b86786b, 1cc53b4, and 8786b51. The other commits fix a few other things I found while working on this code.

The main things to watch out for are:

Make sure you test all the branches of an if/else conditional. There was a ValueError if you tried to create an extra item from only a single value (for example, extra={"prop1": 1}) due to the way you were using .reshape().
Similarly, if you make a change, make sure to add a test to make sure that it works. There was no test of the append() for extra items. To be fair, there wasn't one before either... Adding a test caused me to add the fixes in 5524e99, 477cbed, and ef1fa99
It's helpful to name test functions descriptively. You never actually have to call them anywhere, so using up the entire line with the function name isn't really a problem (sorry, I should have pointed this out before).

bryanwweber · 2020-07-06T18:37:25Z

The test failures are due to using NumPy 1.11. The docs for .full() are: https://docs.scipy.org/doc/numpy-1.11.0/reference/generated/numpy.full.html#numpy.full

The default dtype for .full() in 1.11 is float, which means string values can't be included in extra arrays without specifying a different dtype on creating. NumPy 1.12 and up use np.array(fill_value).dtype as the default, so don't have this problem. I guess if we want to support NumPy 1.11 (which is the version available in the Ubuntu 16.04 repositories), we'll have to specify the dtype for all the extra arrays...

interfaces/cython/cantera/composite.py

interfaces/cython/cantera/test/test_thermo.py

ischoegl · 2020-07-09T18:49:41Z

The test failures are due to using NumPy 1.11. The docs for .full() are: https://docs.scipy.org/doc/numpy-1.11.0/reference/generated/numpy.full.html#numpy.full

The default dtype for .full() in 1.11 is float, which means string values can't be included in extra arrays without specifying a different dtype on creating. NumPy 1.12 and up use np.array(fill_value).dtype as the default, so don't have this problem. I guess if we want to support NumPy 1.11 (which is the version available in the Ubuntu 16.04 repositories), we'll have to specify the dtype for all the extra arrays...

Ubuntu 16.04 support ends April 2021, which is less than a year out. Citing a conservative (albeit current) distro, CentOS8 uses numpy 1.14. GitHub's CI still uses 18.04 as ubuntu-latest, but it likely won't be long before it is upgraded to 20.04. I guess it boils down to a similar question as #891, where it's only a system-installed version that will cause issues?

PS: It appears that numpy requirements are actually not strictly enforced in SConstruct, and there’s only a warning issued?

bryanwweber · 2020-07-10T18:45:02Z

Thanks for the thoughts on NumPy @ischoegl. It is really good to know that a conservative distribution like CentOS includes 1.14. Until this PR, there has not been a required minimum version of NumPy, we have been able to be flexible since we hadn't used very many of the user-interface features (basically, we just used the C headers to provide arrays in Cython). Hence, it was only necessary to provide a warning for versions that we hadn't tried of NumPy. Going forward, it seems it may be necessary to enforce an actual minimum version.

ischoegl · 2020-07-10T19:54:28Z

ad numpy ... Ubuntu 18.04 is still on 1.13.3, so this still puts some restrictions on new features, e.g. #896. I ended up testing for distutils.version.LooseVersion in #900 ...

The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in Cantera#838.

The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in #838.

speth

From reading the discussion, I think all of the previously-mentioned issues have been hashed out at this point, right? I just had one case where I think the behavior should change.

speth · 2020-08-21T21:27:43Z

interfaces/cython/cantera/composite.py

-            raise ValueError("Initial values for extra properties must be "
-                             "supplied in a dict if the SolutionArray is not "
-                             "initially empty")
+                self._extra[name] = np.empty(self._shape)


I don't the 'extra' arrays should be allowed to contain uninitialized values. The existing behavior of requiring initial values if the SolutionArray is not empty should be preserved.

@speth ... I’d ask to consider retaining this behavior (while potentially avoiding np.empty).

Think of a scenario where extra is calculated based on the content of the SolutionArray itself (which typically receives the actual value using some setter only after it is instantiated, so the content of extra is not known a priori). For this case, you’d have to provide dummy values to create extra columns, which would be subsequently overwritten. I believe that automatically initializing values as zeros would be a lot more convenient (and avoid np.empty); columns that contain strings are presumably rare, so having to provide a value there would be ok.

@ischoegl I originally had the same objection as you, but I considered it for a while and I ended up agreeing with @speth. There are two main reasons for this, as I see it:

It is trivially easy to specify a dummy value, since you do not need to specify the shape. For instance:

>>> arr = ct.SolutionArray(gas, (3, 10), extra={"prop": 0, "prop2": 0.0}) >>> arr.prop array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) >>> arr.prop2 array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]) >>> # This is not allowed right now, but would be almost the same, except the dtype >>> arr = ct.SolutionArray(gas, (3, 10), extra=["prop", "prop2"])

Since I've used np.full() to fill the initial value, there's no need to give an array as the initial value.

It isn't clear what dtype should be used - it is not only strings, but whether one wants to use floats vs. ints or any other dtype as well. I suppose floats could be used in place of ints, but there are legitimate use cases for the int dtype, for instance, as a categorical or level variable. I think it would also not be clear why one would have to specify an initial value for some dtypes but not for others.

@bryanwweber ... I didn’t think of the scalar route, so your argument is convincing.

The availability of the scalar initializer is a nice compromise for the case where the value is going to be calculated anyway.

If I'm reading this block of code correctly now, the only time that this line can be called is if self._shape is (0,), right? It might be more clear if the value was just set to self._extra[name] = np.empty(0) then.

Yes, that's correct, it only runs if self._shape == (0,). I'm not sure it's clearer to just pass 0 though. Shapes of NumPy arrays are always returned as tuples, and just writing np.empty(0) implies to me that 0 will be the value in the array (notwithstanding the name empty), not that there's a shape of 0. So I think it would be better to use np.empty(shape=(0,)), which is then just self._shape, and putting in (0,) I thought would be more confusing than using self._shape, because the question arises, if these are equal why not just use the existing variable name?

I think explicitly showing the size as zero (with whatever notation you like) is preferable because it makes it clear that this doesn't result in an array with uninitialized values. Tracing the logic back to see that self._shape is (0,) if we're in this branch is not really that simple.

OK, this is fixed now with shape=(0,) 👍

Fix the formatting of the two KeyErrors in SolutionArray.append(). The first one was missing the format method on the string. The second was raising a double-KeyError due to the way the except is handled.

When appending to a SolutionArray, any extra values must be specified as kwargs. If they aren't present, the array lengths would become out of sync, or there would be a KeyError on the kwargs dictionary. Also adds a test for the failure modes of SolutionArray.append().

Move append of extra values to the end of SolutionArray.append(). If the append is done at the beginning of the function, the append can happen even if the state is invalid. This would cause the length of the arrays to become out of sync.

If a single value (e.g., integer or float) is passed as the value for an extra column, the array with the single value cannot be reshaped. Use np.full() instead. Add/rename tests for creating extra items by dicts and iterables, along with tests of the failure conditions.

Make sure that ndarrays are one-dimensional. Move check for bare string. Clarify error messages. Reduce indentation. Add tests for creating extra items from bare strings and ndarrays.

This change bumps the builder that uses the system Python 2 to run SCons and the system Python 3 for the Python interface to Ubuntu 18.04 from 16.04. The reason for this change is that Ubuntu 16.04 provides NumPy 1.11, which does not support flexible dtypes when creating arrays using .full(). We have decided to drop support for NumPy older than 1.12 for this reason. Numpy 1.12 was released in January 2017.

If the extra columns are passed to the SolutionArray, they must either have initial values or the SolutionArray must be empty.

ischoegl reviewed Apr 11, 2020

View reviewed changes

sin-ha force-pushed the Fix#829 branch from f630bce to 4c5da92 Compare April 13, 2020 18:44

ischoegl reviewed Apr 13, 2020

View reviewed changes

bryanwweber reviewed Apr 14, 2020

View reviewed changes

interfaces/cython/cantera/test/test_thermo.py Outdated Show resolved Hide resolved

sin-ha force-pushed the Fix#829 branch from a6a8bfd to ceee08d Compare April 14, 2020 14:28

sin-ha force-pushed the Fix#829 branch from ceee08d to 18a1c1f Compare May 17, 2020 13:34

bryanwweber requested changes May 29, 2020

View reviewed changes

speth changed the base branch from master to main June 30, 2020 23:14

sin-ha force-pushed the Fix#829 branch 3 times, most recently from edebe9b to d9ec808 Compare July 2, 2020 12:08

bryanwweber requested changes Jul 2, 2020

View reviewed changes

sin-ha force-pushed the Fix#829 branch from 1f907ca to 1b57f6d Compare July 3, 2020 19:08

bryanwweber mentioned this pull request Jul 6, 2020

Allow SolutionArray to append more than one state at a time Cantera/enhancements#57

Open

bryanwweber requested a review from speth July 6, 2020 18:37

bryanwweber mentioned this pull request Jul 8, 2020

Make arrays of computed properties on Solution and SolutionArray immutable Cantera/enhancements#58

Open

ischoegl mentioned this pull request Jul 8, 2020

Sequences can be appended as extra items on a SolutionArray #895

Closed

ischoegl reviewed Jul 8, 2020

View reviewed changes

interfaces/cython/cantera/composite.py Show resolved Hide resolved

ischoegl reviewed Jul 8, 2020

View reviewed changes

interfaces/cython/cantera/test/test_thermo.py Show resolved Hide resolved

This comment has been minimized.

Sign in to view

bryanwweber force-pushed the Fix#829 branch from 253aaf2 to c922c1b Compare July 15, 2020 14:10

bryanwweber added a commit to bryanwweber/cantera that referenced this pull request Jul 15, 2020

[Cython] Enforce a minimum NumPy version

42ce1ec

The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in Cantera#838.

bryanwweber added a commit to bryanwweber/cantera that referenced this pull request Jul 23, 2020

[Cython] Enforce a minimum NumPy version

46d8ff7

The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in Cantera#838.

speth pushed a commit that referenced this pull request Jul 27, 2020

[Cython] Enforce a minimum NumPy version

6edb133

The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in #838.

ischoegl mentioned this pull request Aug 6, 2020

Support file I/O for string columns in SolutionArray #900

Merged

4 tasks

speth requested changes Aug 21, 2020

View reviewed changes

sin-ha and others added 13 commits August 21, 2020 20:12

Fixes indexing in SolutionArray

0904052

Include initialisation of extra columns with any iterable

d3d33dc

Remove redundant Numpy conversion

1c5eb09

Minor formatting updates to SolutionArray

b3809e5

Fix append method for SolutionArray

d0bf649

Add sin-ha to AUTHORS

c4187f0

[Thermo] Format of KeyErrors in SolutionArray

11f0b9c

Fix the formatting of the two KeyErrors in SolutionArray.append(). The first one was missing the format method on the string. The second was raising a double-KeyError due to the way the except is handled.

[Thermo] Append to SolutionArray all at once

fb4557c

Move append of extra values to the end of SolutionArray.append(). If the append is done at the beginning of the function, the append can happen even if the state is invalid. This would cause the length of the arrays to become out of sync.

[Test] Rename assigning to slice test

ce73776

[Thermo] Clarify extra items from iterables

c1f30c9

Make sure that ndarrays are one-dimensional. Move check for bare string. Clarify error messages. Reduce indentation. Add tests for creating extra items from bare strings and ndarrays.

[Thermo] Remove unused variables in SolutionArray

bb8ee9d

bryanwweber force-pushed the Fix#829 branch from 0a48784 to 70eeb26 Compare August 22, 2020 00:13

bryanwweber force-pushed the Fix#829 branch from 70eeb26 to ea662f0 Compare August 22, 2020 00:17

bryanwweber approved these changes Aug 22, 2020

View reviewed changes

[SolutionArray] Disallow empty extra columns

8d88fee

If the extra columns are passed to the SolutionArray, they must either have initial values or the SolutionArray must be empty.

bryanwweber force-pushed the Fix#829 branch from ea662f0 to 8d88fee Compare August 31, 2020 23:40

speth approved these changes Sep 1, 2020

View reviewed changes

speth merged commit 3f455e1 into Cantera:main Sep 1, 2020

	np.append(value,kwargs.pop(name))
	np.append(value, kwargs.pop(name))

Fixes Indexed assignment of extra columns in SolutionArray #838

Fixes Indexed assignment of extra columns in SolutionArray #838

Conversation

sin-ha commented Apr 1, 2020

Choose a reason for hiding this comment

sin-ha Apr 12, 2020 • edited Loading

Choose a reason for hiding this comment

ischoegl Apr 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sin-ha Apr 12, 2020 • edited Loading

Choose a reason for hiding this comment

sin-ha Apr 12, 2020 • edited Loading

Choose a reason for hiding this comment

ischoegl Apr 12, 2020 • edited Loading

Choose a reason for hiding this comment

ischoegl Apr 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ischoegl Apr 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryanwweber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryanwweber commented May 29, 2020

bryanwweber commented Jun 29, 2020 • edited Loading

sin-ha commented Jun 29, 2020

bryanwweber commented Jun 29, 2020

bryanwweber left a comment

Choose a reason for hiding this comment

sin-ha commented Jul 2, 2020

bryanwweber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sin-ha Jul 3, 2020 • edited Loading

Choose a reason for hiding this comment

bryanwweber commented Jul 6, 2020

bryanwweber commented Jul 6, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ischoegl commented Jul 9, 2020 • edited Loading

bryanwweber commented Jul 10, 2020

ischoegl commented Jul 10, 2020 • edited Loading

speth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ischoegl Aug 23, 2020 • edited Loading

Choose a reason for hiding this comment

bryanwweber Aug 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sin-ha Apr 12, 2020 •

edited

Loading

ischoegl Apr 12, 2020 •

edited

Loading

sin-ha Apr 12, 2020 •

edited

Loading

sin-ha Apr 12, 2020 •

edited

Loading

ischoegl Apr 12, 2020 •

edited

Loading

ischoegl Apr 13, 2020 •

edited

Loading

ischoegl Apr 14, 2020 •

edited

Loading

bryanwweber commented Jun 29, 2020 •

edited

Loading

sin-ha Jul 3, 2020 •

edited

Loading

ischoegl commented Jul 9, 2020 •

edited

Loading

ischoegl commented Jul 10, 2020 •

edited

Loading

ischoegl Aug 23, 2020 •

edited

Loading

bryanwweber Aug 24, 2020 •

edited

Loading